This notebook was created for analysis and prediction making of the Default of credit card clients Data Set from UCI Machine Learning Library. The data set can be accessed separately from the UCI Machine Learning Repository page, here.
In their paper "The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. (Yeh I. C. & Lien C. H.,2009)", which can be found here, Yeh I. C. & Lien C. H. review six data mining techniques (discriminant analysis, logistic regression, Bayesclassifier, nearest neighbor, artificial neural networks, and classification trees) and their applications on credit scoring. Then, using the real cardholders’ credit risk data in Tai-wan, they compare the classification accuracy among them.
In another paper titled "Machine Learning Approaches to Predict Default of Credit Card Clients. (Liu, R.L. (2018))", which can be found here, Liu, R.L. compares traditional machine learning models, i.e. Support Vector Machine, k-Nearest Neighbors, Decision Tree and Random Forest, with Feedforward Neural Network and Long Short-Term Memory.
Below there are the description of the attributes that will be used in our model for better understanding of the data:
LIMIT_BAL: Amount of the given credit (NT dollar). It includes both the individual consumer credit and his/her family (supplementary) credit.SEX: Gender (1 = male; 2 = female).EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).AGE: Age (year).PAY_1: the repayment status in September, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.PAY_2: the repayment status in August, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.PAY_3: the repayment status in July, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.PAY_4: the repayment status in June, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.PAY_5: the repayment status in May, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.PAY_6: the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.BILL_AMT1: Amount of bill statement (NT dollar). Amount of bill statement in September, 2005.BILL_AMT2: Amount of bill statement (NT dollar). Amount of bill statement in August, 2005.BILL_AMT3: Amount of bill statement (NT dollar). Amount of bill statement in July, 2005.BILL_AMT4: Amount of bill statement (NT dollar). Amount of bill statement in June, 2005.BILL_AMT5: Amount of bill statement (NT dollar). Amount of bill statement in May, 2005.BILL_AMT6: Amount of bill statement (NT dollar). Amount of bill statement in April, 2005.PAY_AMT1: Amount of previous payment (NT dollar). Amount paid in September, 2005.PAY_AMT2: Amount of previous payment (NT dollar). Amount paid in August, 2005.PAY_AMT3: Amount of previous payment (NT dollar). Amount paid in July, 2005.PAY_AMT4: Amount of previous payment (NT dollar). Amount paid in June, 2005.PAY_AMT5: Amount of previous payment (NT dollar). Amount paid in May, 2005.PAY_AMT6: Amount of previous payment (NT dollar). Amount paid in June, 2005.dpnm: Default payment next month.(Yes = 1, No = 0)We will create 3 models in order to make predictions and compare them with the original paper. These models are:
In order to be consistent with the original paper and have the same base for our results, we will use the same metrics. These metrics are: Accuracy and F1 score. Accuracy is $\frac{Number of correct predictions}{Number of samples}$. When the dataset is imbalanced, accuracy may not be sufficient, because simply predicting all samples to be the major class can still get high accuracy. In such situation, a good metrics to use is f1 score. F1-score is calculated by $\frac{2*precision*recall}{precision+recall}$, where precision is $\frac{True Positives}{True Positives+False Positives}$ and recall is $\frac{True Positives}{True Positives+False Negatives}$. Precision measures a model’s ability to correctly identify positive samples and recall measures the proportion of positive samples that are identified. F1-score ranges from 0 (cannot make true positive prediction) to 1 (being correct in all predictions).
In addition to the above, we compute confusion matrix for each model as well the Area Under the Curve (AUC) and plot the ROC curves. A typical ROC curve has False Positive Rate (FPR) on the X-axis and True Positive Rate (TPR) on the Y-axis. The area covered by the curve is the area between the line (ROC) and the axis. This area covered is AUC. The bigger the area covered, the better the machine learning models is at distinguishing the given classes. Ideal value for AUC is 1.
Using the models we created, we will try to predict the class value of dpnm column with better scores (accuracy and f1) than the scores presented in the two papers.
### General libraries ###
import pandas as pd
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import graphviz
from graphviz import Source
from IPython.display import SVG
import os
##################################
### ML Models ###
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree.export import export_text
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
##################################
### Metrics ###
from yellowbrick.classifier import ConfusionMatrix
from sklearn import metrics
from sklearn.metrics import f1_score,confusion_matrix, mean_squared_error, mean_absolute_error, classification_report, roc_auc_score, roc_curve, precision_score, recall_score
In this section we will load the data from the csv file and check for any "impurities", such as null values or duplicate rows. If any of these will appear, we will remove them from the data set. We will also plot the correlations of the class column with all the other columns.
# Load the data.
data = pd.read_csv('default of credit card clients.csv')
# Information
data.info()
Since the ID column is for indexing purposes only, we remove it from the data set.
# Drop "ID" column.
data = data.drop(['ID'], axis=1)
# Check for null values.
print(
f"There are {data.isna().any().sum()} cells with null values in the data set.")
Below is the plot of the correlation matrix for the data set.
# Plot of the correlation matrix for the data set
plt.figure(figsize=(20, 20))
sns.heatmap(data.corr(), annot=True, cmap='rainbow',
cbar=False, linewidth=0.5, fmt='.2f')
plt.title('Correlation Matrix')
On the correlation matrix we can see that the columns BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5 and BILL_AMT6 are highly correlated (>0.90) with BILL_AMT1. Because of that we can exclude the from our models and we keep only the BILL_AMT1, as we will see later.
In this part we prepare our data for our models. This means that we choose the columns that will be our independed variables and which column the class that we want to predict. Once we are done with that, we split our data into train and test sets and perfom a standardization upon them.
# Check for duplicate rows.
print(
f"There are {data[data.columns[:-1]].duplicated().sum()} duplicate rows in the data set.")
# Remove duplicate rows.
data = data.drop_duplicates()
print("The duplicate rows were removed.")
# Perform One Hot encoding on 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'.
data = pd.get_dummies(
data, columns=['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])
# Distinguish attribute columns and class column.
# BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5 and BILL_AMT6 are excluded.
features = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'BILL_AMT1', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'PAY_1_-2', 'PAY_1_-1', 'PAY_1_0',
'PAY_1_1', 'PAY_1_2', 'PAY_1_3', 'PAY_1_4', 'PAY_1_5', 'PAY_1_6', 'PAY_1_7', 'PAY_1_8', 'PAY_2_-2', 'PAY_2_-1', 'PAY_2_0',
'PAY_2_1', 'PAY_2_2', 'PAY_2_3', 'PAY_2_4', 'PAY_2_5', 'PAY_2_6', 'PAY_2_7', 'PAY_2_8', 'PAY_3_-2', 'PAY_3_-1', 'PAY_3_0',
'PAY_3_1', 'PAY_3_2', 'PAY_3_3', 'PAY_3_4', 'PAY_3_5', 'PAY_3_6', 'PAY_3_7', 'PAY_3_8', 'PAY_4_-2', 'PAY_4_-1', 'PAY_4_0',
'PAY_4_1', 'PAY_4_2', 'PAY_4_3', 'PAY_4_4', 'PAY_4_5', 'PAY_4_6', 'PAY_4_7', 'PAY_4_8', 'PAY_5_-2', 'PAY_5_-1', 'PAY_5_0',
'PAY_5_2', 'PAY_5_3', 'PAY_5_4', 'PAY_5_5', 'PAY_5_6', 'PAY_5_7', 'PAY_5_8', 'PAY_6_-2', 'PAY_6_-1', 'PAY_6_0', 'PAY_6_2',
'PAY_6_3', 'PAY_6_4', 'PAY_6_5', 'PAY_6_6', 'PAY_6_7', 'PAY_6_8']
X = data[features]
y = data['dpnm']
# Split to train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=251)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In this section we build and try 3 models:
Each model will be trained and make a prediction for the test set. Accuracy, f1 score, confusion matrix and ROC AUC will be calculated for each model.
# Initialize a Logistic Regression estimator.
logreg = LogisticRegression(multi_class='auto', random_state=25, n_jobs=-1)
# Train the estimator.
logreg.fit(X_train, y_train)
# Make predictions.
log_pred = logreg.predict(X_test)
# CV score
logreg_cv = cross_val_score(logreg, X, y, cv=10)
# Accuracy: 1 is perfect prediction.
print('Accuracy: %.3f' % logreg.score(X_test, y_test))
# Cross-Validation accuracy
print('Cross-validation accuracy: %0.3f' % logreg_cv.mean())
# Precision
print('Precision: %.3f' % precision_score(y_test, log_pred))
# Recall
print('Recall: %.3f' % recall_score(y_test, log_pred))
# f1 score: best value at 1 (perfect precision and recall) and worst at 0.
print('F1 score: %.3f' % f1_score(y_test, log_pred))
# Predict probabilities for the test data.
logreg_probs = logreg.predict_proba(X_test)
# Keep Probabilities of the positive class only.
logreg_probs = logreg_probs[:, 1]
# Compute the AUC Score.
auc_logreg = roc_auc_score(y_test, logreg_probs)
print('AUC: %.2f' % auc_logreg)
# Plot confusion matrix for Decision tree.
cm = ConfusionMatrix(logreg, is_fitted=True)
cm.score(X_test, y_test)
cm.poof()
# Get the ROC curves.
logreg_fpr, logreg_tpr, logreg_thresholds = roc_curve(y_test, logreg_probs)
# Plot the ROC curve.
plt.figure(figsize=(8, 8))
plt.plot(logreg_fpr, logreg_tpr, color='red',
label='Logistic Regression ROC (AUC= %0.2f)' % auc_logreg)
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label='random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend()
plt.show()
# Initialize a decision tree estimator.
tr = tree.DecisionTreeClassifier(
max_depth=3, criterion='gini', random_state=25)
# Train the estimator.
tr.fit(X_train, y_train)
# Plot the tree.
fig = plt.figure(figsize=(23, 15))
tree.plot_tree(tr.fit(X_train, y_train), feature_names=X.columns,
filled=True, rounded=True, fontsize=16)
plt.title('Decision Tree')
# Print the tree in a simplified version.
r = export_text(tr, feature_names=X.columns.tolist())
print(r)
# Make predictions.
tr_pred = tr.predict(X_test)
# CV score
tr_cv = cross_val_score(tr, X, y, cv=10)
# Accuracy: 1 is perfect prediction.
print('Accuracy: %.3f' % tr.score(X_test, y_test))
# Cross-Validation accuracy
print('Cross-validation accuracy: %0.3f' % tr_cv.mean())
# Precision
print('Precision: %.3f' % precision_score(y_test, tr_pred))
# Recall
print('Precision: %.3f' % recall_score(y_test, tr_pred))
# f1 score: best value at 1 (perfect precision and recall) and worst at 0.
print('F1 score: %.3f' % f1_score(y_test, tr_pred))
# Predict propabilities for the test data.
tr_probs = tr.predict_proba(X_test)
# Keep Probabilities of the positive class only.
tr_probs = tr_probs[:, 1]
# Compute the AUC Score.
auc_tr = roc_auc_score(y_test, tr_probs)
print('AUC: %.2f' % auc_tr)
# Plot confusion matrix for Decision tree.
cm = ConfusionMatrix(tr, is_fitted=True)
cm.score(X_test, y_test)
cm.poof()
# Get the ROC curves.
tr_fpr, tr_tpr, tr_thresholds = roc_curve(y_test, tr_probs)
# Plot the ROC curve.
plt.figure(figsize=(8, 8))
plt.plot(tr_fpr, tr_tpr, color='red',
label='Decision tree ROC (AUC= %0.2f)' % auc_tr)
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label='random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend()
plt.show()
# Initialize a Multi-layer Perceptron classifier.
mlp = MLPClassifier(hidden_layer_sizes=(32, 32), max_iter=1000, activation='logistic',
alpha=0.01, random_state=25, shuffle=True, verbose=False)
# Train the classifier.
mlp.fit(X_train, y_train)
# Make predictions.
mlp_pred = mlp.predict(X_test)
# CV score
mlp_cv = cross_val_score(mlp, X, y, cv=10)
# Accuracy: 1 is perfect prediction.
print('Accuracy: %.3f' % mlp.score(X_test, y_test))
# Cross-Validation accuracy
print('Cross-validation accuracy: %0.3f' % mlp_cv.mean())
# Precision
print('Precision: %.3f' % precision_score(y_test, mlp_pred))
# Recall
print('Recall: %.3f' % recall_score(y_test, mlp_pred))
# f1 score: best value at 1 (perfect precision and recall) and worst at 0.
print('F1 score: %.3f' % f1_score(y_test, mlp_pred))
# Predict probabilities for the test data.
mlp_probs = mlp.predict_proba(X_test)
# Keep probabilities of the positive class only.
mlp_probs = mlp_probs[:, 1]
# Compute the AUC Score.
auc_mlp = roc_auc_score(y_test, mlp_probs)
print('AUC: %.2f' % auc_mlp)
# Plot confusion matrix for Decision tree.
cm = ConfusionMatrix(mlp, is_fitted=True)
cm.score(X_test, y_test)
cm.poof()
# Get the ROC curves.
mlp_fpr, mlp_tpr, mlp_thresholds = roc_curve(y_test, mlp_probs)
# Plot the ROC curve.
plt.figure(figsize=(8, 8))
plt.plot(mlp_fpr, mlp_tpr, color='red', label='MLP ROC (AUC= %0.2f)' % auc_mlp)
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label='random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend()
plt.show()
metrics = ['Accuracy', 'CV accuracy', 'Precision', 'Recall', 'F1', 'ROC AUC']
# Plot metrics.
fig = go.Figure(data=[
go.Bar(name='Logistic Regression', x=metrics,
y=[logreg.score(X_test, y_test), logreg_cv.mean(), precision_score(y_test, log_pred), recall_score(y_test, log_pred), f1_score(y_test, log_pred), auc_logreg]),
go.Bar(name='Decision tree', x=metrics,
y=[tr.score(X_test, y_test), tr_cv.mean(), precision_score(y_test, tr_pred), recall_score(y_test, tr_pred), f1_score(y_test, tr_pred), auc_tr]),
go.Bar(name='Neural Network', x=metrics,
y=[mlp.score(X_test, y_test), mlp_cv.mean(), precision_score(y_test, mlp_pred), recall_score(y_test, mlp_pred), f1_score(y_test, mlp_pred), auc_mlp])
])
fig.update_layout(title_text='Metrics for each model',
barmode='group', xaxis_tickangle=-45, bargroupgap=0.05)
fig.show()
# Plot the ROC curve.
plt.figure(figsize=(8, 8))
plt.plot(mlp_fpr, mlp_tpr, color='green',
label='MLP ROC (AUC= %0.2f)' % auc_mlp)
plt.plot(tr_fpr, tr_tpr, color='orange',
label='Decision tree ROC (AUC= %0.2f)' % auc_tr)
plt.plot(logreg_fpr, logreg_tpr, color='red',
label='LogReg ROC (AUC= %0.2f)' % auc_logreg)
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--', label='random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves for GridSearch')
plt.legend()
plt.show()
d = {
'': ['Logistic Regression', 'Decision Tree', 'Neural Network (MLP)'],
'Accuracy': [logreg.score(X_test, y_test), tr.score(X_test, y_test), mlp.score(X_test, y_test)],
'CV Accuracy': [logreg_cv.mean(), tr_cv.mean(), mlp_cv.mean()],
'Precision': [precision_score(y_test, log_pred), precision_score(y_test, tr_pred), precision_score(y_test, mlp_pred)],
'Recall': [recall_score(y_test, log_pred), recall_score(y_test, tr_pred), recall_score(y_test, mlp_pred)],
'F1': [f1_score(y_test, log_pred), f1_score(y_test, tr_pred), f1_score(y_test, mlp_pred)],
'ROC AUC': [auc_logreg, auc_tr, auc_mlp]
}
results = pd.DataFrame(data=d).round(4).set_index('')
results
| Error rate | Accuracy | |
|---|---|---|
| Logistic Regression | 0.18 | 0.82 |
| Decision tree | 0.17 | 0.83 |
| Neural Network | 0.17 | 0.83 |
| Accuracy | F1 | |
|---|---|---|
| Decision tree | 0.7973 | 0.4912 |
| Neural Network | 0.8227 | 0.4593 |